WIP #7907

tcsc · 2021-08-13T11:46:02Z

This is intended as a soundness check on work to address issue #7606.

Goal

To have a node (ssh for now, but applicable to other services as well) to notice when the proxy configuration changes and adapts automatically.

Context

Imagine you have a cluster with the a node connected in via a tunnel on port 3024

 ┌──────┐            ┌───────┐            ┌──────┐
 │ Node ├────────────► Proxy ├────────────► Auth │
 └──────┘            └───────┘            └──────┘
  auth_servers:       public_addr:
  - example.com:3080    example.com:3080
                      tunnel_public_address:
                        example.com:4024

Now imagine you change the proxy config so that tunnel_public_address is example.com:4024. You either restart the proxy, or reload the proxy config with a SIGHUP.

...and the node doesn't reconnect to the proxy, because even though the auth_server address hasn't changed the node has cached the old tunnel_public_address and keeps trying to connect to that.

You can always restart the node to have it reconnect, but that would be a pain if you have thousands of nodes.

This PR's approach

I've initially attacked this by trying to leverage the discovery & re-connection machinery that already exists in the node (i.e. the AgentPool). Originally, the agent pool was given the tunnel_public_address it the node discovered on startup and always attempted to reconnect to that.

The WIP code in this MR removes that static config, and replaces it with a callback that will furnish the AgentPool with a list of potential proxy addresses. In the proof-of-concept handler it polls webapi/find to get the current tunnel_public_address value, and adds a new Agent to the AgentPool for that address if necessary. This is based on the assumption that once we add it to the agent pool, an automatic re-connection will automatically fall out of the existing machinery.

This works as expected up to a point,

...but

It turns out there is a bunch of stuff in the AgentPool that assumes the original tunnel_public_address is still available (see Agent.AccessPoint) in order to process the new connection. This makes perfect sense in the context of the proxy Discovery, in that it's designed to handle locating and adding extra proxies to the pool. It models the notion of moving a tunnel proxy to a new address somewhat less well.

This extends past the AgentPool as well. There are several places in the larger application that have a baked-in assumption that the Client wrapping the initial connection to the auth server is special, and the corresponding tunnel parameters will be valid (if not actually connected) for the lifetime of the process.

We can, of course, track down all of the state information that needs updating and do so, but it is beginning to sound like I am making this change at the wrong level.

Alternative approaches

Tunnel watchdog

We could somehow detect the address change (e.g. polling webapi/find occasionally and noting any changes in the reported addresses) and automatically restarting the entire teleport instance on a change to the tunnel_public_address, rather than trying to manage the change internally.

This is starting to look like a sensible option, as so much of the teleport process assumes that the initial tunnel location will not change. I've started working on a proof-of-concept implementation that polls webapi/find, but this becomes problematic when a cluster might have multiple, independently configured proxies that can be hit by any given poll. Some extra work would be to be done in order prevent the SSH node restarting in a legitimate configuration.

Refactoring out

A larger-scale refactor of the AgentPool (and friends) to try and make it easier to handle this "proxy moved" situation. This will essentially mean that the initial client connection to the proxy becomes just another client in the pool, and can be rotated out of the AgentPools proxy set like any other.

....?

What do i want out of this discussion?

Basically, either confirmation that the AgentPool is/is not the correct place to try an implement this change. And, if it's not, some alternative approaches to the problem

tcsc · 2021-08-16T08:45:46Z

@awly, @fspmarshall suggested you might be able to lend an opinion here. Any suggestions gratefully accepted.

tcsc · 2021-08-18T10:27:16Z

Going with the higher-level restart for now.

tcsc · 2021-09-13T00:30:11Z

Closed in favour of #8102

WIP

ca94176

tcsc closed this Sep 13, 2021

zmb3 deleted the tcsc/issue7606 branch April 26, 2023 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP #7907

WIP #7907

tcsc commented Aug 13, 2021 •

edited

Loading

tcsc commented Aug 16, 2021

tcsc commented Aug 18, 2021

tcsc commented Sep 13, 2021

WIP #7907

WIP #7907

Conversation

tcsc commented Aug 13, 2021 • edited Loading

Goal

Context

This PR's approach

Alternative approaches

Tunnel watchdog

Refactoring out

....?

What do i want out of this discussion?

tcsc commented Aug 16, 2021

tcsc commented Aug 18, 2021

tcsc commented Sep 13, 2021

tcsc commented Aug 13, 2021 •

edited

Loading